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SPEECH BANDWIDTH EXTENSION 

CROSS REFERENCE TO RELATED APPLICATIONS 

This application claims the benefit of U.S. Provisional Application No. 
60/260,922, filed January 12, 2001 (Attorney Docket No. 040071-531), which is 
hereby incorporated herein by reference in its entirety. 

BACKGROUND 

The far most common way to receive speech signals is directly face-to-face 
with only the ear setting a lower frequency limit around 20 Hz and an upper 
frequency limit around 20 kHz. The common telephone narrowband speech signal 
10 bandwidth of 0.3 - 3.4 kHz is considerably narrower than what one would 

experience in a face-to-face encounter with a sound source, but it is sufficient to 
facilitate the reliable communication of speech. However, there would be a 
benefit to be obtained by extending this narrowband speech signal to a wider 
bandwidth in that the perceived naturalness of the speech signal would be 
15 increased. 

Bandwidth extension methods previously suggested include codebook 
approaches (see, e.g., Y. Yoshida, M Abe, An algorithm to reconstruct wide-band 
speech from narrowband speech based on codebook mapping, Conf. Proc, ICSLP 
94, pp. 1591-1594, Yokohama, 1994; and J. Epps, W.H. Holmes, Speech 
20 enhancement using STC-based bandwidth extension, Conf. Proc. ICSLP, 1998) 
and aliasing/folding approaches (see, e.g., J. Makhoul, M. Berouti, High 
frequency regeneration in speech coding systems, Conf. Proc. ICASSP, pp. 428- 
431, Washington, USA, 1979; and H. Yasukawa, Quality enhancement of band 
limited speech by filtering and multirate techniques, Conf. Proc. ICSLP 94, pp. 




1607-1610, Yokohama, 1994). The aliasing approach is generally simple in 
structure. In this approach, the narrowband signal is up-sampled by inserting 
zeros between the narrow-band signal samples. When using such up-sampling, a 
reconstruction lowpass filter having a cut-off frequency at half the new sampling 
rate is used. When a shaping filter is substituted for this filter, the aliased/folded 
frequency content in the upper-frequency region extends the speech content. The 
drawbacks of this technique are that a harmonic speech structure is not continued 
in the upper-frequency region, and that a suitable amplitude level of the upper- 
frequency-band is generally not achieved for all speech sounds. 

The codebook approach is a more advanced solution, in which the narrow 
frequency-band is analyzed with a codebook look-up method. The codebook index 
is matched one-to-one with a filter that is suitable for shaping an excitation signal. 
The excitation signal can, for example, be created with an aliasing/folding method. 
The codebook approach has also been tested for the lower frequency-band (see, 
e.g., the Y. Yoshida and M Abe reference cited above). 

Speech signals are generally described by a short-time-segments model 
comprising a filter and a signal excitation. The filter describes the human vocal 
tract and the coupling between the excitation source and the vocal tract. The 
sound radiation characteristics from the mouth may also be included in this filter. 
Generally, it is sufficient to use an all-pole filter to estimate the vocal tract, 
coupling, and radiation characteristics, This filter then will only vaguely 
approximate zeros introduced by, for example, a nasal tract, or lateral consonants. 
This estimation problem can be reduced by increasing the filter order. 

Speech signals are considered to be stationary during segments of 10-30 
ms. This segment duration is determined by the fact that it takes approximately 70 
ms for tissue in the vocal tract to change from one end-position to another. Hence, 
the vocal tract and the speech sounds can be completely different after this 
interval, but rarely after shorter durations of time. 



During voiced speech segments, the poles of the filter can be described as 
estimates of the formants of speech, and also the coupling between the formant and 
the excitation source. The formants are the resonance frequencies of the vocal 
tract, either the whole or parts of it. Hence, the amplitude level at these formant 
frequencies is larger compared to adjacent frequencies, assuming the vocal folds 
source is present. 

During unvoiced speech segments, the poles of the filter do not describe the 
formants, although the poles of the filter describe the resonance frequencies of the 
vocal tract, or more correctly the oral tract. The unvoiced speech is generated 
with almost no use of the lower part of the vocal tract. The number of noticeable 
resonances is often limited to one or two in the oral tract because of the short 
length of the cavity. Another aspect of the short resonators common for unvoiced 
speech segments is that the speech content is high in frequency, generally having 
prominent and perceptually important content above 3.4 kHz. 

The sources that excite the filter can be divided into two types: the quasi- 
periodic and the turbulent noise source. The vocal folds in the larynx are the main 
source during voiced speech segments. This source is of a quasi-periodic type, 
normally having a fundamental frequency in the range of 70-400 Hz. This 
fundamental frequency is also called the pitch frequency, and a person can, during 
speech, increase the pitch frequency by about 100% compared to a relaxed state. 
The signal generated by the vocal folds look like a skewed half-wave rectified 
sinus, and thereby also generates harmonics. The harmonics are perceptually 
important due to the fact that formants are grouped according to their excitation's 
fundamental frequency; that is, formants having the same fundamental frequency 
will form a speech sound. It has been shown that in concurrent speech 
environments the fundamental frequency is even more important than the direction 
of the sound. 

The turbulent noise source is generated by steering, with a constriction, an 
air stream against an obstacle or only causing a turbulent air volume velocity. 



When an obstacle is used, the resulting noise amplitude level is higher. Noise 
sources can be generated at many locations in the vocal tract, but the most 
prominent ones are generated in the oral cavity. 

The perception of speech by the human hearing mechanism has some 
important functionalities. Human hearing is commonly described as having a 
logarithmic sensitivity with respect to both frequency and amplitude level. As a 
result, low frequencies carry more information in smaller frequency-bands. One 
way of describing this is the Barkscale, having frequency bands of 100 Hz in the 
lower frequency region and approximately 1 kHz in the upper frequency region. 
The amplitude level is often presented in decibels since this logarithmic scale is 
quite consistent with the amplitude level sensitivity of human hearing, or the 
loudness perception. 

SUMMARY 

It should be emphasized that the terms "comprises" and "comprising", 
when used in this specification, are taken to specify the presence of stated features, 
integers, steps or components; but the use of these terms does not preclude the 
presence or edition of one or more other features, integers, steps, components or 
groups thereof. 

It is desirable to facilitate a perceptually acceptable extension of the 
narrow-band speech signal (300-3400 Hz) into a wide-band speech signal (50-3400 
Hz). 

In accordance with one aspect of the invention, it is possible to expand the 
narrow-band speech signal downward into a lower frequency band than is found in 
the narrow band speech signal. Accomplishing this includes analyzing the first 
narrow-band speech signal to generate one or more parameters; synthesizing a 
lower frequency-band signal based on at least one of the one or more parameters; 
and combining the synthesized lower frequency-band signal with a second narrow- 
band speech signal that is derived from the first narrow-band speech signal. In 



some embodiments, the second narrow-band speech signal is generated by a 
technique that includes up-sampling the narrow-band speech signal. 

To facilitate synthesizing the lower frequency-band signal, the one or more 
parameters include a pitch frequency parameter. Synthesizing the lower 
frequency-band signal based on at least one of the one or more parameters includes 
generating continuous sine tones that are based on the pitch frequency parameter. 
In some embodiments, the narrow-band speech signal comprises a plurality of 
narrow-band speech signal segments. In such cases, the pitch frequency parameter 
can be estimated for each of the narrow-band speech signal segments; and the 
continuous sine tones can be changed gradually during a first part of each speech 
signal segment. 

In another aspect, synthesizing the lower frequency-band signal based on at 
least one of the one or more parameters may further comprise adaptively changing 
an amplitude level of the continuous sine tones based on an amplitude level of at 
least one formant in the narrow-band speech signal segment. The at least one 
formant in the narrow-band speech signal segment is preferably a first formant in 
the narrow-band speech signal segment. 

In yet another aspect, synthesizing the lower frequency-band signal based 
on at least one of the one or more parameters can further comprise lowpass 
filtering the continuous sine tones. This lowpass filtering of the continuous sine 
tones is preferably performed with an upper cutoff frequency substantially equal to 
300 Hz. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The objects and advantages of the invention will be understood by reading 
the following detailed description in conjunction with the drawings in which: 

FIG. 1 is a block diagram of an exemplary technique for extending the 
bandwidth of a speech signal, in accordance with the invention; 



FIG. 2 is a block diagram of an upper-band speech synthesizer, in 
accordance with an aspect of the invention; 

FIG. 3 is a block diagram of a lower-band speech synthesizer, in 
accordance with an aspect of the invention; and 

FIG. 4 is block diagram of a narrow-band speech analyzer, in accordance 
with an aspect of the invention. 

DETAILED DESCRIPTION 

The various features of the invention will now be described with reference 
to the figures, in which like parts are identified with the same reference characters. 

The various aspects of the invention are described in connection with a 
number of exemplary embodiments. To facilitate an understanding of the 
invention, many aspects of the invention are described in terms of sequences of 
actions to be performed by elements of a computer system. It will be recognized 
that in each of the embodiments, the various actions could be performed by 
specialized circuits (e.g., discrete logic gates interconnected to perform a 
specialized function), by program instructions being executed by one or more 
processors, or by a combination above moreover, the invention can additionally be 
considered to be embodied entirely within any form of computer readable carrier, 
such as solid-state memory, magnetic disk, optical disk or carrier wave (such as 
radio frequency, audio frequency or optical frequency carrier waves) containing an 
appropriate set of computer instructions that would cause a processor to carry out 
the techniques described herein. Thus, the various aspects of the invention may be 
embodied in many different forms, and all such forms are contemplated to be 
within the scope of the invention. For each of the various aspects of the invention, 
any such form of embodiments may be referred to herein as "logic configured to" 
perform a described action, or alternatively as "logic that" performs a described 
action. 



Since in the beginning, few telephones will have the wide-band vocoder 
facility, a technique is presented herein for expanding the common narrow-band 
speech signal into a wide-band speech signal using only the equipment in the 
receiving telephone. This will give the impression of a wide-band speech signal 
regardless of which vocoder is used. The robust technique described herein is 
based on speech acoustics and fundamentals of human hearing. That is, during 
voiced speech segments, the harmonic structure of the speech signal is extended, 
and the correct amount of speech energy relative to the energy of the common 
narrow frequency-band is introduced. During unvoiced speech segment, a fricated 
noise may be introduced in the upper frequency-band. 

The bandwidth extension method can be divided into an analysis part and a 
synthesis part as shown in FIG. 1. In the exemplary embodiment depicted in FIG. 
1, the analysis part comprises a narrow-band speech analyzer 101, which takes the 
common narrow-band signal as its input and generates the parameters that control 
the synthesis part. The synthesis part may comprise either an upper-band speech 
synthesizer 103, a lower-band speech synthesizer 105, or both as depicted in FIG. 
1. The synthesis part generates the extended bandwidth speech signals, }fogh( n ) 
and/or )>i ow (0, which have a higher sampling rate (e.g., two times higher) than that 
of the input signal, x(ri). In order to permit it to be combined with the synthesized 
signals , the original input signal is up-sampled by an up-sampling unit 107. The 
output of the up-sampling unit 107, jc 2 , is then combined with the extended 
bandwidth speech signals, y high (n) and yi ow (n) by a combining unit 109, which 
generates the resultant excitation signal y(ri). 

The upper-band speech synthesizer 103 comprises an excitation spectrum 
extender and filters that shape the speech content in the upper frequency-band as 
shown in FIG. 2. The excitation spectrum is expanded by using a spectrum 
equalizer 201 to equalize the amplitudes of the entire narrow-band speech 
spectrum, selected parts of which are then copied by a spectrum copy unit 203. 
This results in a signal having a higher sampling rate as compared to that of the 



input signal for example twice the sampling rate — but this could differ in 
other embodiments. The copying is performed such that a harmonic structure is 
continued. The resultant excitation signal, Z), is then shaped by a bandpass filter 
205 having a fixed configuration. The output of the bandpass filter 205 is a 
bandpass-filtered signal, DH high , The purpose of the bandpass filter 205 is to 
introduce a descending amplitude level for higher frequencies and to cut off the 
frequency region below the upper-band. The gain of the extended spectrum is 
controlled by signals (A k m and CTRL) generated by the narrow-band speech 
analyzer 101. The resultant excitation signal, D, is supplied to each of a voiced 
gain unit 207 and an unvoiced gain unit 209, which generate therefrom the 
respective gain signals g v and g u based on the amplitude control signal A kjn . A 
third gain signal, g 0 , is also provided. The third gain signal, g 0 , is preferably a 
very low constant gain factor that is used when the corresponding speech is neither 
voiced nor fricated; that is, wen no actual speech is present in the speech signal, or 
when a speech sound is present in the speech signal but does not have significant 
high-band speech content as in the closure part of stop consonants. An aspect of 
the CTRL signal selects which of the three gain signals (g v> g u and g 0 ) will be 
used to adjust the amplitude of the bandpass-filtered signal DH^^ 

In another aspect of the invention, the amplitude spectrum shape can be 
further controlled more specifically with a formant filter 211, whose transfer 
function resembles a formant structure. The formant filter 211 operates on the 
bandpass-filtered signal £>#high> using filter characteristics provided by a formant 
filter control signal F u q which is provided by the narrow-band speech analyzer 
101. The formant filter 211 preferably has several peaks in the upper frequency- 
band. The formant peaks are preferably placed at equal frequency distances, 
having the same distance as the two highest formant peaks found in the narrow 
frequency-band. The output of the formant filter 211 is a formant-filtered signal 
DVHfc^y. An aspect of the CTRL signal (provided by the narrow-band speech 
analyzer 101) controls whether the bandpass-filtered signal £>#high or alternatively 



the formant-filtered signal DVH^^ will be amplified by one of the three gain 
signals (g v , g u and g$ to generate the extended bandwidth speech signal, y^g^n). 
These and other aspects of the upper-band speech synthesizer 103 are described in 
greater detail later in this description in connection with an exemplary embodiment 
of the invention. 

As mentioned earlier, in conjunction with (or alternatively in lieu of) the 
bandwidth expansion upward in frequency, it is also possible to expand the 
bandwidth downward in frequency. The lower-band speech synthesizer 105, 
which serves this purpose, is shown in greater detail in FIG. 3. The narrow 
telephone bandwidth provided in conventional systems has a lower cut-off 
frequency of 300 Hz. The resolution of human hearing in frequency is 
logarithmic. Translating the bandwidths to the Barkscale (a traditional logarithmic 
frequency scale), the 50-300 Hz and 3400-7000 Hz regions become approximately 
three and four Barkbands wide, respectively. This implies that the lower region is 
also perceptually important. The speech content in this lower frequency region 
mostly comprises the pitch and its harmonics during voiced speech segments. 
During unvoiced speech segments, the lower frequency region is not perceptually 
important. The technique employed for estimating the speech content in this 
region, in accordance with this aspect of the invention, is to introduce sinus tones 
at the pitch frequency and the harmonics up to 300 Hz. Generally, the number of 
tones is four or less, since the pitch frequency is above 70 Hz. This is described 
in greater detail below. 

The analysis part of the bandwidth expansion method mainly involves use 
of a pitch frequency estimator, a pitch activity detector (PAD) 403, a fricated 
speech detector (fricated activity detector, FAD) 405 and a formant peaks 
amplitude estimator (e.g, blocks 407, 409, 411 and 413, as described below), as 
shown in FIG. 4. The pitch activity detector 403 is used to decide the amount of 
gain to be used on the extended excitation spectrum. The general behavior of the 
narrow-band speech analyzer 101 is that fricated speech segments are preferably 
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given a larger gain since, for example, fricatives have a substantial part of the 
speech energy in the upper frequency region. The pitch-frequency estimator 401 
is used to calculate which frequencies the sinus tones introduced in the lower 
frequency region should have. 

The formant peaks amplitude estimation is accomplished by estimating a 
linear predictor filter 407. The output of the linear predictor filter 407 is also used 
to calculate the excitation signal in the spectrum equalizer 201 . The narrowband 
speech signal, x, is modeled by an all-pole filter a and an excitation signal e, 

x(n)=e(ri)a(0) +e(n-l)a{\) + . . . +e(n-p)a(p), (1) 
where p is the filter order. Equation (1) is valid during stationary signal 
conditions, which is approximately the case for individual speech segments. The 
model is then changed for each speech segment. The filter coefficients, a{ri), are 
supplied to a pole frequency calculation unit 409 and to an amplitude calculation 
unit 411. The amplitude calculation unit 411 uses the filter coefficients a(n) and 
the pole frequency values, F N q, to calculate the amplitude values at the 
frequencies of the complex-conjugated poles. Different scaled versions of these 
amplitude values are then generated. In one version, the amplitude values are 
multiplied by a constant, C h to yield values, denoted g/(m), for use in the lower- 
band speech synthesizer 105. In another version, the amplitude levels are scaled 
by a logarithm scaling unit 413 to give a relatively more perceptually correct 
amplitude level, denoted herein as A k m , where k is both the estimated formant 
frequency number (e.g., 1,2, 3, 4, ...) and the complex-conjugated pole-pair 
index (these should be the same) and m is the index separating the M segments, 
and is not a running segment number. The voiced gain unit 207 and fricated gain 
unit 209 in the upper-band speech synthesizer 103 calculate their respective gain 
values by linearly combining the logarithmic amplitude levels, A k m . Different 
combinators are used for voiced and fricated (unvoiced) speech segments. The 
gain is used to amplify the excitation spectrum, as explained earlier. Within the 
narrow-band speech analyzer 101, a fricated speech activity detector (FAD) uses 
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other linear combinations of the logarithmic amplitude levels, A k m to detect 
fricated speech sound. A voice activity detector 415 is further provided in the 
narrow-band speech analyzer 101 to generate a signal that indicates the presence or 
absence of speech in the input signal, x(n). The outputs of the pitch activity 
detector 403, the voice activity detector 415 and the fricated speech activity 
detector 405 are supplied to control logic 417 that generates the CTRL signals that 
are supplied to the upper-band speech synthesizer 103. 

The pole frequency calculation unit 409 also supplies its output 
frequencies, F N q, to an upper formants synthesizer 419, which generates 
synthesized formants, Fjjq, for use in the upper-band frequency synthesizer 103. 
Generation of the synthesized upper formants, F N q, is described in greater detail 
below. 

As mentioned earlier, the lower speech synthesized signal, yi ow (n), and 
upper speech synthesized signal, y^g^Ot), are combined (e.g., added) to the up- 
sampled narrow-band signal, x 2 (n) to generate the final wideband speech signal: 

yw =yiowW> + yhi g h(n) +*2(*)- < 2 > 

U pper-Band Speech Synthesizer 103 

The upper-band speech synthesizer 103 will now be described in greater 
detail in connection with an exemplary embodiment. The upper frequency-band 
that is generated in this exemplary embodiment has a frequency range of 3 .4-7 
kHz, although this could differ in other embodiments. This frequency range 
generally includes the fourth through eighth formants during voiced speech 
segments, but the highest are often not perceptually important. An unvoiced 
speech segment that includes, for example, a fricative or an affricate consonant has 
a substantial part of its speech energy in this frequency region. 

Referring back now to FIG. 2, the excitation signal, e(ri) (which is 
generated from the original signal x(n) by means of the filtering that is performed 
by the inverse linear predictor filter) is first extended upwards in frequency. One 
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simple and robust method to accomplish this is to copy the spectrum from lower 
frequencies to higher frequencies. During this copying, it is very important to 
continue any harmonic structure. The spectrum of the excitation, E(j), is divided 
into three zones: the lower match zone, £(//); the middle zone, E(f m ); and the 
upper match zone, E(f u ). The amplitude spectrum of the excitation, \E(f)\ , will 
have a comb-like structure with the peaks at a distance of the pitch frequency 
during voiced speech segments. The spectrum equalizer 201 calculates the full 
complex spectrum on a grid of frequencies,^, i = 0..J - 1 with a Fast Fourier 
Transform (FFT), where / represents the number of sampling frequency bins in the 
grid. The frequencies^- are examined for the maximum spectrum amplitude, 
\E(fj)\, in each range ^ e/ 7 and^- ef u : 

\E(f b max)\ = max|£(£)|,£ ef h (3) 

\E(f u ,max)\= max|^)|,^e/ w . 
A harmonic structure is continued since the maximum in the amplitude 
spectrum likely coincides with a harmonic tone of the pitch-frequency. When the 
speech segment is unvoiced, the technique operates in the same manner, even 
though no harmonic structure needs to be continued. Then, to extend the 
excitation spectrum into higher frequencies, the spectrum copy unit 203 repeatedly 
copies the spectrum between the two found maxima up until fj_i is reached: 

' D(f)= E(f), 
D(f+c)= E(f), 

D(f ? = E(f in ) 

The complex conjugated mirrored part of the spectrum, inherent of real- 
valued time signals, is calculated from: 



f=f f 

cKi,2,...)-(C max -y; max ), 

f+c<f { , 
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D(f M ) = D%i), i = 1,2,. ..,M. (5) 
This results in the bandwidth expanded excitation spectrum D having a 

doubled sample rate. The spectrum D can also be constructed by means of a 

combination of interpolation, filtering and transpositions. 

The bandwidth expanded excitation spectrum D is then filtered by a 

bandpass filter 205. This yields a filtered expanded excitation spectrum, D high : 

Dhigh = D'H high (6) 

In the exemplary embodiment, the bandpass filter 205 has a filtering 
characteristic, H hi g h (= h high in the time domain), that has a lower cut-off 
frequency of 3400 Hz and a continuously descending level for higher frequencies. 

In some embodiments, in order to enhance the perceived speech signal, the 
upper-band speech synthesizer 103 may further include a formant filter 211 which 
gives spectral peaks at estimated formant frequencies in the upper frequency range, 
F m , Fjji^ ^ In the exemplary embodiment, the formant filter 211 has one 
complex conjugated pole-pair and one complex conjugated zero-pair for each 
synthetic formant frequency, with the poles having larger amplitudes: 



°(1 -r pW e J2 ^)(l -r p( l)e ^) (1 -r p (2)e^ K l - (2)e ^) 



(7) 



where r z is the constant amplitude of the zeros, r p is the constant amplitude of the 
poles and v 0 is a fixed normalizing gain. The arrangement of the exemplary 
formant filter 211 reduces the interference between the poles compared with a 
filter having only poles. The poles and zeros have lower amplitudes for higher 
formant frequencies in order to bring about an increasing bandwidth for higher 
formant frequencies. The distances in frequency between the formants are 
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preferably equal. The equal distance is motivated by the fact that formants in the 
higher frequency region are most often resonances in the front-most cavity, or 
tube, of the vocal tract and hence are multiples of a lowest resonance frequency. 
The frequency distance calculation is presented below in the section entitled 
"Narrow-Band Speech Analyzer 101." 

The output, D vMgh , of the formant filter is thus given by: 

£>vhigh= v ' D high < 8 > 
In preferred embodiments, the upper-band speech synthesizer 103 may 
alternatively be based on either bandpass-filtered signal, D high , or the formant- 
filtered signal, D vhigh . The selection is made by the CTRL signal. Thus, a first 
Inverse Fast Fourier Transform unit (IFFT) 213 is provided to convert the 
bandpass-filtered signal into the time domain: 

and a second IFFT 215 is provided to convert the formant-filtered signal into the 
time domain: 

dyhighW = ^(Pvhigh) < 10 > 

The upper-band speech synthesizer 103 preferably includes a suitable 
amplifier 217 that amplifies the extended excitation spectrum by an amount, g, 
based on the level in the narrow-band frequency region. The output of the upper- 
band speech synthesizer 103 is therefore either: 

yhighW = 8* d high(n) (H) 

or 

ytiighW = S ' dvhigh^ > (12) 



depending on the value of the CTRL signal. 
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The gain, g, is calculated differently, depending on whether the speech 
signal in the current speech segment represents voiced or unvoiced speech. When 
the current segment contains voiced speech, with a detected pitch, the voiced gain 
unit 207 generates a voiced gain signal, g v , that is derived from the logarithmically 
scaled amplitudes at the frequencies of the pole, F N1 ,F N2 ,--*F NN , in the linear 
prediction filter: 



^ log io 



jy. a (0*Y (0 



(13) 



f v =t^h y (k) (14) 



k=l 



io g " 



(15) 



where p is the order of the linear predictor filter 407; Yxxm ls ^ e auto-correlation 
of the narrow-band signal over the last M - 1 voiced segments and the current 
unvoiced segment; h v is the linear combinator of the log amplitudes, A k m \ a m {T) 
are the linear predictors over the last M - 1 voiced segments and the current 
unvoiced segment; and m=l for voiced segments. The logarithm of the 
amplitudes is used because this complies with the perception of amplitude levels 
and it is likely that the gain level should be dependent on the log amplitudes. 
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During unvoiced speech segments with fricated speech, the unvoiced gain 
signal, g u , is determined as a function of the log amplitude levels over the last 
M - 1 voiced segments and the current unvoiced segment: 



Af p 



\0 8 " 



(17) 



where A k m are the log amplitudes for the last M - 1 voiced segments and the 
current segment. That is, given a mix of voiced and unvoiced segments, one would 
have to reach back more than M-l previous segments in order to find the M-l most 
recent voiced segments. A value of M is preferably determined empirically, with a 
value of 10 often being sufficiently high. The final gain, g, is then given by: 

g v , when voiced 
g= g u , when fricated (18) 
g 0 , neither voiced nor fricated 



where g 0 is a very low constant gain factor. More particularly, g 0 is preferably at 
10 least 20 dB below the long-time average for the other gains, but more generally it 
is a constant that should depend on the application. For example, it may be 
preferred, in some applications, to also copy the background sound to the high 
band, whereas in other applications a total mute of the background in the high 
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band may be preferred. In the exemplary embodiment illustrated in FIG. 2, the 
selection represented in Equation (18) is made by the CTRL signal. 

Lower-Band Speech Synthesizer 105 

The lower-band speech synthesizer 105 will now be described in greater 
detail in connection with an exemplary embodiment, shown in FIG. 3. The lower 
frequency-band that is generated in this exemplary embodiment has a frequency 
range of 50-300 Hz, although this could differ in other embodiments. This 
frequency range mainly has voiced speech content. The excitation spectrum of 
voiced speech is the pitch frequency and its harmonics. The harmonics decrease in 
amplitude with increasing frequency. The excitation spectrum is filtered by a 
formant structure and for the lower frequency range the first formant is of 
importance. The first formant is in the approximate range of 250-850 Hz during 
voiced speech. As a result, the natural amplitude levels of the harmonics in the 
frequency range 50-300 Hz are either approximately equal or have a descending 
slope towards lower frequencies. Low frequency tones are capable of perceptually 
masking higher frequencies substantially — this is the so-called upward spread of 
masking. This implies that caution must be taken when introducing tones in the 
low frequency region. Accordingly, the estimated gain is preferably taken to be 
less than the estimated amplitude of the first formant peak. The suggested 
bandwidth extension downward in frequency is accomplished by means of a 
continuous sine tone generator 301 that introduces continuous sine tones. The 
amplitude levels of all the sine tones are adaptively changed, with a fraction of the 
amplitude level of the first formant: 
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where C/ is a constant and m is the running segment number. 

The low frequency continuous sine tone generators 301 are based on the 
pitch frequency and integer multiples of the pitch frequency. The pitch is 
estimated for each speech segment. To avoid discontinuities in the sine tones, the 
tones are changed gradually during a first part of each segment. For each integer 
multiple, i, of the pitch frequency, the continuous sine tone generator 301 
generates each sine tone signal, Sj(n), in accordance with: 

f (gi{m - 1) + n » W-jEfr-q ) sin (i{4>(m) + n) (ui(m - 1) + n ^ w >^ m - | > )) , n-€,.-..li (20) 
\ 5i(fn)stn(t(^(m)+n)w(m)), n = Li + 1,.. . ,£ - 1 



where 0(ra) is the phase compensation needed to maintain a continuous sinusoid 
between segments, aXjri) is the pitch frequency of the current segment m, L is the 
number of samples in the segment, and is the end sample of the soft transition 
within segments. The complete synthesized lower speech signal s(ri)> is then given 
by: 

4 

(21) 



which also is then optionally filtered by an optional lowpass filter 303 that, in this 
example, has a limit of 300 Hz. In Equation (21), the summation range of 
i=l, . . . , 4 is presented here merely as an example. In practice, the range should 
be selected such that all sine tones will be added together. The resultant output 
signal, yioJflX is given by: 

p 1qw 

y IO „w = *,<»> • £ -*>*,„,(*>• (22) 

k=0 
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Narrow-Band Speech Analyzer 101 

Referring now to FIG. 4, the narrow-band speech is estimated with a model 
of a linear prediction filter (linear predictor 407) and an excitation signal (see 
Equation (1)). 

The placement of the synthetic formant frequencies (Fuq) in the upper 
frequency region is based on the estimated formant frequencies (F^q) in the 
narrow-band speech signal. The estimated linear prediction filter 407 has poles at 
the formant frequencies of the narrow-band speech signal. In preferred 
embodiments, the poles at the two highest frequencies, F N ( N _j) and F NN , are used 
in the analysis of the placement of the synthetic formants. The reason for this is 
that these estimated formant frequencies are most likely to be resonances of the 
same front-most tube. If this front-most tube is considered to be uniform, open in 
the front end, and closed in the back end, the resonances occur at, 

f = 2nM c w=1?2> 3 9 (23) 
4 / 



where c = 354 m/s at body temperature and 1 atmosphere pressure, and / is the 
length of the tube. The parameters in Equation (23) can be estimated by 
calculating the average n, and c/l can be calculated by the frequency distance, 



%r-l) =rOUIld 



F +F 

N(N-1) NN 



2(F -F ) 

y NN N{N-\Y ) 



(24) 



- = 2(F -F ) 

I K NN N(N-iy 



(25) 
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The fraction c/l is then also limited: A maximum tube length of 20 cm is a 
reasonable physical limit, which gives a lower distance limit between the 
resonance frequencies of 0.9 kHz. The synthetic formant frequencies, F\jq 9 are 
then calculated with Equation (23), for n=n^^ N _iy+2 9 ^(Ar-l)"*"^,... 
corresponding to F m , F ra , .... 

The detectors used in the analysis part are: a fricated speech activity 
detector (FAD 405), a voiced/unvoiced (pitch) decision maker (PAD 403), and a 
general voice activity detector (VAD 415). VADs 415 are well known, and need 
not be described here in great detail. A possible choice is the VAD used in the 
GSM AMR vocoder specification (see Voice Activity Detector (VAD) for 
Adaptive Multi-Rate (AMR) speech traffic channels, GSM 06.94, ver 7.1.1, ETSI, 
1998). The voiced/unvoiced decision is derived from a pitch frequency estimator. 
Pitch frequency estimators and detectors are also well known, and need not be 
described here in great detail. See, for example, W. Hess, Pitch determination of 
speech signals. Springer-Verlag, 1983. 

The fricated speech activity detector (FAD 405) is used to detect when the 
current speech segment contains fricative or affricate consonants. This can then be 
used to select a proper gain calculation method. The fricated speech activity 
detector is similar in structure to the linear gain estimation methods. The first 
stage in the detector calculates a linear combination, ly(k,m), of the estimated 
formant peak amplitudes, A k m in the current segment as well as in the last M - 1 
segments with pitch: 




(26) 



The estimated value o is low when the current segment contains fricated speech. 
An exponential average of o over segments with voiced speech is taken, forming 
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o . When the estimated value o is below the average o the segment is estimated to 
contain a fricated speech sound. 

The upper-frequency-band speech synthesizer 103 uses different upper- 
band gains, depending on whether it is synthesizing an upper frequency-band 
signal for voiced speech, fricated speech, or neither voiced nor fricated speech. 
These situations can be determined with the above described detectors and control 
logic as 

voiced, VAD&PAD 
fricated, VAD&PAD&FAD (27) 
k neither, VAD\(PADScFAD) 

where "&" represents a logical AND operator, " | " represents a logical OR 
operator, and a "bar" over a variable represents a logical NOT operator. 

The invention has been described with reference to a particular 
embodiment. However, it will be readily apparent to those skilled in the art that it 
is possible to embody the invention in specific forms other than those of the 
preferred embodiment described above. This may be done without departing from 
the spirit of the invention. 

For example, the upper-band speech synthesizer 103 could be embodied in 
ways other than the exemplary embodiment described with respect to FIG. 2. In 
one alternative, the bandpass filter 205 is eliminated entirely, with the output of 
the spectrum copy unit 203 being supplied directly to the formant filter 211. This 
is a viable alternative because a reduction below 3400 Hz can be accomplished 
with the formant filter 211, and during fricated speech periods (i.e., when the 
output of the formant filter is not selected) this reduction is not very important. 

In another alternative of the upper-band speech synthesizer 103, the 
bandpass filter 205 is replaced by a highpass filter. 
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In yet another alternative of the upper-band speech synthesizer 103, the 
spectrum copy unit 203 is replaced by a spectrum move unit that first performs the 
copying function and then zeroes out the section that has been copied. 

In still another alternative of the upper-band speech synthesizer 103, the 
bandpass filter 205 and formant filter 211 can be eliminated entirely ~ if the 
content below 3400 H is left without a reduction in the upper-band synthesis signal 
it would be quite disturbing to the listener, but it could be left in place, with a 
clear degradation in speech quality. 

The tube model of the vocal tract upon which the above-described 
embodiments are based is a simple one. In yet other alternative embodiments, 
those skilled in the art will readily be able to apply the same principles set forth 
above in an application based on a more advanced tube model. 

Furthermore, in the description of the FAD and the gains, as set forth 
above, the terms "proportional" and "linear" are used. However, in still other 
alternatives, non-linear processing may be used instead. This may be performed, 
for example, by means of an artificial neural network (ANN), configured in for 
example a feed-forward-back-propagation or radial basis network. One ANN 
takes the A k m as input, and generates the g u of Equation (16) as output. Yet 
another ANN takes the A k m as input and generates o of Equation (26) as output. 

Finally, it is additionally noted that, in embodiments in which the lower- 
band synthesis is performed without the upper-band synthesis, there is no need for 
an up-sampling of the narrow-band signal. 

Thus, the preferred embodiment is merely illustrative and should not be 
considered restrictive in anyway. 



